EDA: preparation

First, we split the dataset in three sub-sets (S1, S2 and S3) for further analysis.

  • S1: Historic station production data
  • S2: PCA data (upper part, up to Row 5113)
  • S3: PCA data (lower part, from Row 5114)
dt_s1_orig <- as.data.frame(na.omit(dt[, 2:99], invert=FALSE)); 
dt_s2_orig <- as.data.frame(dt[1:nrow(dt_s1_orig), 100:456]); 
dt_s3_orig <- as.data.frame(1+dt[nrow(dt_s2_orig):nrow(dt), 100:456]); 

[1.1] Compute Statistcs

We computed the same statistic measures for Sections S1, S2 and S3 separately.

The analysis of S1, being the most relevant output at this point, delivered the following output:

head(statistics_s1)
##    Station     Mean      Sd    Min  Sec_qrt   Median Fourth_qrt      Max
## 1:    ACME 16877462 7869606  12000 11404200 16946400   23734800 31347900
## 2:    ADAX 16237534 7905850 510000 10611000 16299300   23027400 31227000
## 3:    ALTU 17119189 7702989    900 11674500 17073600   23903700 31411500
## 4:    APAC 17010565 7883455   3300 11637000 17062500   23909400 31616100
## 5:    ARNE 17560173 7917965 477300 11666400 17578500   24503700 32645700
## 6:    BEAV 17612143 7911267    300 11493600 17520900   24683100 32884800

As mentioned before, we conducted the same analysis for S2 and S3 as well, resulting in datatables with the same respective structure.


[1.2] Outlier detection

To identify outliers in the dataset, we used a 1.5 * IQR benchmark for Max and Min and we ran a for-loop.

This identified that in the S1 dataset there are 0 outliers within our defined benchmark. We therefore did not see the need to adjust the dataset.


[1.3] Statistical Visualization

In this section, we first mapped the stations using Leaflet, next we plotted the statistical analysis output from above.


Leaflet map

We clustered the stations by production mean:

  • Blue: Mean >17m
  • Green: Mean between 16m and 17m
  • Red: Mean <16m

Group details
Blue
head(blue)
##    Station     Mean
## 1:    ALTU 17119189
## 2:    ARNE 17560173
## 3:    BEAV 17612143
## 4:    BESS 17304074
## 5:    BOIS 18688943
## 6:    BUFF 17304512
nrow(blue)
## [1] 23

Selected Example for further analysis: GOOD


Green
head(green)
##    Station     Mean
## 1:    ACME 16877462
## 2:    ADAX 16237534
## 3:    BIXB 15969634
## 4:    BLAC 16061707
## 5:    BOWL 16034081
## 6:    BREC 16655129
nrow(green)
## [1] 47

Selected Example for further analysis: ACME


Red
head(red)
##    Station     Mean
## 1:    CLAY 15486166
## 2:    CLOU 15656934
## 3:    COOK 15667280
## 4:    COPA 15896897
## 5:    EUFA 15718914
## 6:    IDAB 15849510
nrow(red)
## [1] 22

Selected Example for further analysis: STIG


Visualization of S1

We first plotted all solar panels according to their daily production means. We highlighted high- and low-performers by labelling them accordingly with their station name.


Next, we also visualized the production standard deviations of each station. Again, we highlighted high- and low-performers by labelling them accordingly with their station name.



Visualization of S2 & S3
S2 Density


S3 Density


[1.4] Data Scaling

We furthermore scaled the data:

tipify <- function(x){
  mu <- mean(x, na.rm = TRUE);
  s <- sd(x, na.rm = TRUE);
  s[s == 0] <- 1;
  x <- (x - mu) / s;
}

And we plotted the resulting densities:


S1


Visualization of S2 & S3
S2 Density


S3 Density


[1.5] Correlation detection

We analyzed a) correlations between solar panels and respective PCAs and b) correlations between explanatory variables themsevles.


Correlations between solar panels and PCA

We identified PCA 1 to be the most relevant explanatory variable for 97 of the 98 solar panels.

The exception was station #60, MTHE, with PCA 2 as most relevant correlation.

head(corr_main)
##    Station Corr_PCA
## 1:    ACME        1
## 2:    ADAX        1
## 3:    ALTU        1
## 4:    APAC        1
## 5:    ARNE        1
## 6:    BEAV        1

Correlation between explanatory variables themselves
max(max_corr_per_PCA);
## [1] 0.07248046

The highest absolute value of correlation coeficient of each PCA to other PCAs is 0,072. Not enough evidence of correlation using this approach to make some PCAs redundant.


[1.6] Visualization of correlation

To get a better understanding of above-conducted analysis we visualized the findings.


Visualization solar panel to PCA correlation
Table


Histogram


Visualization PCA to PCA correlation
Table


Histogram


[1.7] Dimensionality reduction

We used the following code to conduct the Dimensionality reduction:

select_important <- function(dat, n_vars, y){
  varimp <- filterVarImp(x = dat, y=y, nonpara=TRUE);
  varimp <- data.table(variable=rownames(varimp),imp=varimp[, 1]);
  varimp <- varimp[order(-imp)];
  selected <- varimp$variable[1:n_vars];
  return(selected);
};

We conducted our analysis for three solar panels (Blue: GOOD, Green: ACME, Red: STIG) representing the three groups identified in section [1.2]. As the caclulation requires large amounts of CPU and memory (and therefore time), we post the output here instead.

  • Blue [GOOD]

         [1] “PCA1” “PCA2” “PCA4” “PCA5” “PCA3” “PCA6” “PCA7” “PCA24” “PCA9” “PCA32”

  • Green [ACME]:

         [1] “PCA1” “PCA2” “PCA7” “PCA4” “PCA6” “PCA5” “PCA24” “PCA32” “PCA26” “PCA35”

  • Red [STIG]:

         [1] “PCA1” “PCA2” “PCA7” “PCA4” “PCA5” “PCA6” “PCA24” “PCA26” “PCA32” “PCA33”


Group E - Datadores